Generalized supervised meta-blocking
نویسندگان
چکیده
Entity Resolution is a core data integration task that relies on Blocking to scale large datasets. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies of any structuredness schema heterogeneity. This comes at the cost many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced by Meta-blocking techniques leverage entity co-occurrence patterns inside blocks: first, entities are weighted in proportion their matching likelihood, then, pruning discards with lowest scores. Supervised goes beyond this approach combining multiple scores per comparison into feature vector fed binary classifier. By using probabilistic classifiers, Generalized associates every pair candidates score used algorithm. For higher effectiveness, new weighting schemes examined as features. Through extensive experiments, we identify best algorithms, optimal sets features, well minimum possible size training set.
منابع مشابه
Supervised Meta-blocking
Entity Resolution matches mentions of the same entity. Being an expensive task for large data, its performance can be improved by blocking, i.e., grouping similar entities and comparing only entities in the same group. Blocking improves the run-time of Entity Resolution, but it still involves unnecessary comparisons that limit its performance. Meta-blocking is the process of restructuring a blo...
متن کاملEvaluating Blocking Probability in Generalized
Generalized connectors provide the capability to connect a singte input to one or more outputs. Such networks play an important role in supporting any application that involves the dktribution of information from one source to many destinations or many sources to many destinations. We present the first analytic model for evaluating blocking probabdity in generalized connectors. The model altows...
متن کاملSemi-Supervised Learning via Generalized Maximum Entropy
Various supervised inference methods can be analyzed as convex duals of the generalized maximum entropy (MaxEnt) framework. Generalized MaxEnt aims to find a distribution that maximizes an entropy function while respecting prior information represented as potential functions in miscellaneous forms of constraints and/or penalties. We extend this framework to semi-supervised learning by incorpora...
متن کاملGeneralized mixture models, semi-supervised learning, and unknown class inference
In this paper, we discuss generalized mixture models and related semi-supervised learning methods, and show how they can be used to provide explicit methods for unknown class inference. After a brief description of standard mixture modeling and current model-based semi-supervised learning methods, we provide the generalization and discuss its computational implementation using three-stage expec...
متن کاملGeneralized Optimization Framework for Graph-based Semi-supervised Learning
We develop a generalized optimization framework for graphbased semi-supervised learning. The framework gives as particular cases the Standard Laplacian, Normalized Laplacian and PageRank based methods. We have also provided new probabilistic interpretation based on random walks and characterized the limiting behaviour of the methods. The random walk based interpretation allows us to explain dif...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the VLDB Endowment
سال: 2022
ISSN: ['2150-8097']
DOI: https://doi.org/10.14778/3538598.3538611